In this notebook, we will explore our preprocessed dataset. We will analyze the data for outliers, correlated features, and merits in features against the selected features the model outputs. We will use different feature selection methods such as PCA, Forward/Backward selection, and Decision Trees; and will determine which features we can exclude based on the results and error rate. We want to train our models only on the features most important to determining the popularity of the song per genre (with the exception of Decision Tree since it performs its own feature selection).

We use kMeans to explore how well kMeans could predict a song's popularity based on its features and perhaps even if it can cluster songs into genres based on the track's features.

This data has been preprocessed and does not contain null values.

Analyze data for outliers

We have a total of 17,473 track objects, 5,654 of which are our 'popular' (aka target) tracks. The total number of tracks per genre is distributed within a few standard deviations of one another. The average amount of tracks per genre is 4,625. The pop genre has the lowest count with 4,158 tracks. The r&b genre has the highest count at 5,233 tracks.

Analyze data for correlated features

We didn't find many features that are significantly correlated to "popularity" other than "chartrank", but that "chartrank" feature is expected to be correlated and therefore not insightful in this case. The only significant correlations to note are "energy" and "loudness" within all the data.

Let's see if we can find more correlations when we separate the data by genre.

Since we know ("popularity", "chartrank") & ("energy", "loudness") correlations are present everywhere, we will omit its mention any further in the analysis of the genre feature correlations below.

The Jazz genre's "valence" & "danceability" features are somewhat correlated. There's a correlation between the "valence" & "energy" features; "energy" & "acousticness"; "loudness" & "acousticness"; and "time_signature_3" & "time_signature_4".

Latin genre had a correlation between "time_signature_3" & "time_signature_4".

Pop genre correlations: "energy" & "acousticness"; "time_signature_3" & "time_signature_4".

R&B genre had a correlation between "time_signature_3" & "time_signature_4".

Country genre correlation between the "valence" and "energy" features; "energy" & "acousticness"; "loudness" & "acousticness"; "time_signature_3" & "time_signature_4".

Classification

Use Feature Selection methods to determine which features are most important to determining popularity of the song per genre

Now that we've analyzed the correlations in our features, let's separate the target feature from the training set.

Since we know chart rank is correlated to popularity, and we want to base our predictions based on song features, we will drop the 'chartrank' column from our training data

Training data will consist of songs prior to 2019, a total of 20,618 tracks. Test data will consist of songs in 2019 & 2020, summing up to 2,509 tracks. Roughly 12% of our raw data is for testing.

Multiple methods will be implemented to test the parameters that yield the highest accuracy per model, as well as visualize the cross-validation score on a more granular level: per-parameter basis.

Decision Trees

One of the easiest and most efficient ways to determine feature selection is by using Decision Trees. We will start by analyzing all the features for all the tracks. Then, we will test different parameters and will use the parameters that yield the highest accuracy on our model. Finally, we will see if we could improve our accuracy by repeating this process on a more granular level - separating tracks by genre.

Though we were able to get the best parameters all at once, let's plot one of those parameter ranges to

  1. better understand the reasoning behind our model's selection; and
  2. confirm there aren't better parameters to choose from.

For now, let's plot the cross-validation against the max depth parameter range.

Later we'll plot the other 2 parameter ranges: min samples split & min samples leaf.

The plot suggests that a max-depth of 1 is sufficient for returning the highest accuracy. We need to find out which feature the DT is splitting on and analyze its merits. Let's create a Decision Tree model with the suggested min_samples_split & min_samples_leaf values, & set the max_depth at the 3rd highest accuracy rate to get more variance. Them, we'll analyze the classification report to determine how well our model is doing in both precision and recall using those "best" parameters.

As we can see from our low recall score for classifying 'popular', our decision tree model performs worse than randomly guessing at predicting popular track features. The 71% accuracy stems from it's better ability to predict unpopular tracks.

Let's see which features were used in our Decision Tree model.

Based on all the genres and all the features, our Decision Tree selected 'loudness' as the most informational feature to split on. In the 2nd level, it split on 'acousticness' and 'year'. We can dismiss the 'year' feature since our data aims to predict popularity based on audio features. At the 3rd level, it splits on 'explicit_1', 'danceability', and 'loudness' again.

Using these best params, we can next test on max_features and assign a min_impurity_decrease value to help prune off features that don't return more information.

Let's see if our Decision Tree model improves if we fit it on a per-genre basis. We will re-evaluate the previous parameters in order to better fit our smaller datasets. First, we'll observe each parameter range individually for cross-validation scores. Then we'll test on all the parameters to get the best combination of parameters.

Overall, our accuracy only improved in the Jazz, Pop, & R&B genres compared to the DT model based on all audio features across all genres. Only in the R&B genre did we improve our recall score, which was still no better than guessing.

K Nearest Neighbors

First, we'll test on all the tracks for all genres using K Nearest Neighbor as our classifier with Euclidean distance as our distance metric.

Then, we will do a grid search on the best 'k'.

Later, we will analyze the classifier more granularly, on a genre-by-genre case.

Let's evaluate the best number of nearest neighbors to optimize our knn prediction.

The elbow is at roughly 7 neighbors.

Let's evaluate our classifier's accuracy on its 2019-2020 popularity predictions based on audio features:

Though our f1-score is worse than the f1-score of our decision tree, our recall for predicted popularity is much higher!

Backward Selection

Since we don't have too many features, we'll start with a Backward-SFS to find features that will give us the most information when attempting to predict popularity. We'll start by using the K Nearest Neighbors classifier as our model to test on for finding the elbow of a sequentially ordered k-valued array for our Backward Feature Selection model to choose on.

Unable to use the SequentialFeatureSelector method because it was introduced in scikit-learn version 0.24. Anaconda's environment is running on version 0.23.3, without an option to update.

https://scikit-learn.org/stable/modules/feature_selection.html

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html#sklearn.feature_selection.SequentialFeatureSelector

Naive Bayes

Training on tracks across all genres for popularity

Training on tracks per genres for popularity prediction

Training on tracks per genres for genre prediction

In conclusion, out of all of our classification models, Naive Bayes scored the highest in terms of recall. For track prediction, it scored 57% on predicted unpopularity for tracks across all genres & 64% on predicting popularity. In predicting a track's genre, it scored high in the country and jazz genres, mediocre in r&b, and poorly in lating and pop. Averaging that out, the data we are passing into our models is only returning scores slightly better than random guessing.

Cluster Analysis

DBSCAN

So far, our models have failed to return a satisfactory accuracy score to predict audio features that would indicate a 'popular' track. Instead, we may find better luck in detecting a genre based on audio features. We will use the density based clustering method (DBSCAN) to predict a track's genre. DBSCAN is useful in detecting outliers as well, which will be interesting to visualize if our results are meaningful.

We use DBSCAN instead of kMeans because

  1. DBSCAN is better at detecting outliers, resulting in better, more 'complete' clusters. In other words, we reduce the noise that would've otherwise been seen in kMeans.
  2. Unlike kMeans, DBSCAN doesn't assume clusters are convex shaped.

We remove the 'year' feature since it should be irrelavant to genre prediction.
Our classification methods proved 'duration' to be more useful than we had initially imagined, in particular with predicting popularity within the pop, latin, & jazz genres. This makes sense for the Jazz genre, but we're not sure why that would also apply to the Latin & Pop genres. Thus, we shall keep this feature in the data to see if it'll add any more insights to our findings.

Code source below provided by scikit-learn's DBSCAN demo: https://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#sphx-glr-auto-examples-cluster-plot-dbscan-py

min_samples = 10; data = train_matrix

DBSCAN with epsilon value of 0.1 returned 0 clusters, with all points considered "noise points".

DBSCAN with epsilon value of 0.2 returned 54 clusters, with 14,118 considered "noise points".

DBSCAN with epsilon value of 0.3 returned 71 clusters, with 5,793 considered "noise points".

DBSCAN with epsilon value of 0.4 returned 76 clusters, with 2,339 considered "noise points".

DBSCAN with epsilon value of 0.5 returned 69 clusters, with 1,110 considered "noise points".

DBSCAN with epsilon value of 0.6 returned 70 clusters, with 708 considered "noise points".

DBSCAN with epsilon value of 0.7 returned 72 clusters, with 555 considered "noise points".

DBSCAN with epsilon value of 0.8 returned 72 clusters, with 492 considered "noise points".

DBSCAN with epsilon value of 0.9 returned 74 clusters, with 430 considered "noise points".

A small min_sample of 10 returned an average cluster of 62. The range of cluster size for that min sample across an epsilon value between 0.0 & 1.0 was 54 to 74.

min samples = 50; data = train_matrix

We know we have at least 3,000 track samples per genre, so let's increase our min_sample to 50. We chose 50 because at 100 we were getting errors.

As we see in the plot, our accuracy scores did not improve. The last epsilon value of 2.0 with the accuracy score of 100% was because it clustered all points into 1 cluster, making that score invalid. Epsilons > 2.0 were thus not calculated.

A lower epsilon value returned cluster sizes closer to our target cluster size of 5, however, it considered all other points as noise. This leads us to believe we may have too much data.

Tracks per genre in train sample:

min_sample = 50; data = test_matrix

Next, we will test clustering on a smaller sample size - our test data. Our theory is that by reducing our sample size, we might get better completeness and homogeneity scores.

Tracks per genre in test sample:

An epsilon value of 0.7 gave us the best v-measure at 0.078 (combination of homogeneity & completeness). However, that is still a poor score and estimated 13 clusters, whereas an epsilon of 0.5 gave us the closes number of clusters to the true data at size 7.

min_samples=100; data = test matrix

In conclusion, higher min samples gave us better results, but the results were still poor. Minimizing our sample size did not seem to have helped either. The preferred epsilon value was around 0.7.

kMeans

We use kMeans to explore how well kMeans could predict a song's genre based on its features.

Cluster 2 seems the closest to any real cluster.

Let's see how well it'll predict on our test sample.

Although we were able to give our kMeans algorithm the correct amount of genres (aka clusters) to predict on, we still find ourselves with low accuracy scores. We continue to find evidence leading us to determine one cannot predict a song's popularity nor its genre by the basic features given to us by Spotify.